We will mainly use two packages during this workshop: DADA2 and Phyloseq. It is recommanded that you update your R software in order that these packages run smoothly. Indeed, in order to install DADA2, you must have R version 3.5.0 or newer, and Bioconductor version 3.7. Using Bioconductor is the easiest way to install both DADA2 and Phyloseq. You can check your actual R version by executing the version command in R.
versionIf your version is older than 3.5, follow the next steps. If not, skip to the Installing the packages section.
install.packages("installr")
library(installr)
updateR()What is nice with this package is that it should migrate the packages you’ve installed on your older version of R to the folder corresponding to the newest version of R. It also updates all the packages you’ve installed from the CRAN repository.
install.packages('devtools')
library(devtools)
install_github('andreacirilloac/updateR')
library(updateR)
updateR(admin_password = 'Admin user password')In order to migrate your packages from your old version to the new one: Follow this path: /Library/Frameworks/R.framework/Versions/ or /Bibliothèque/Frameworks/R.framework/Versions/ for the french speakers. Change the folder’s name to the 3.5.1 (latest R version).
You then want to update your packages using this command.
update.packages(checkBuilt=TRUE, ask=FALSE)This will only update the packages installed from the CRAN repository.
In order to check if everything went according to plan:
version
packageStatus()Now that you’ve got a brand new R, we’re going to install DADA2 and Phyloseq. For that, we must first install Bioconductor as we’re going to use one of its functions biocLite.
source("https://bioconductor.org/biocLite.R") # Installs Bionconductor
biocLite("dada2") # Installs DADA2
biocLite("phyloseq") # Installs phyloseq If this doesn’t work you can try other methods. Methods for DADA2 and Phyloseq.
If you don’t succeed, we’ll help you out on the first morning of the workshop.
Some packages include vignettes which are documents explaining the pipeline. Use the vignette() function
library(dada2) # Loads the DADA2 package in the environment
vignette("dada2")On R Studio you can type ?‘name’ or help(‘name’) For example:
library(dada2) # Loads the DADA2 package in the environment
?dada2
?filterAndTrim # One of DADA2's functions
help(filterAndTrim)If you’re just using R, use ??‘name’. It will open a page in your internet browser with the documentation on the function you want documentation on.
library(dada2) # Loads the DADA2 package in the environment
??dada2
??filterAndTrim # One of DADA2's functions
You can also see the source code of a function by just typing its name.
library(dada2) # Loads the DADA2 package in the environment
filterAndTrimWhenever you’re launching R, it selects a default directory (i.e a file in which you can save your script, your objects). The function getwd() will give you the path to the directory you are in. In order to change it, the function setwd() is used. You have to specify the path to the directory you want. Another usefull function is list.files() which will return all the files present in the directory you’re in.
getwd()
setwd("/Users/Jack/Documents/BIOME/")
list.files()When you run a pipeline or analysis, it is convenient to save the objects such as matrices or dataframes. By doing so, you won’t have to run your analysis every single time you want to view the object in question. In order to save an object the function saveRDS() can be used. The object can then be opened with the readRDS() function. This pair of functions are an alternative to the save() and load(). Their adventage is to allow the user to give a new name to the saved object when they load it.
mat <- matrix(sample(0:1, 12, replace=TRUE),3,4) # A 3 by 4 matrix containing 0s and 1s
saveRDS(mat, "Neo") # Save mat with Neo as a file name
the.matrix <- readRDS("Neo") # Load the file Neo but it will no longer be named mat but the.matrix in your environment
identical(mat, the.matrix ) # Checks if mat and my.matrix are identical
However if you want to save multiple objects at a time you’ll have to use save()
Reference 4
Phyloseq uses the S4 object system, which is one of the oriented object systems used in R. This system requires a bit of explanation. An S4 object (S4O) can be seen as multiple objects regrouped into one entity. Each object within the S4O has a class by three properties: a name, a representation giving their names and classes for example a car class would have representation(brand=“character”, horsepower=“numeric”) a charactor vector of classes that this class contains The class is created with the function setClass()
setClass("Person", representation(name = "character", age = "numeric"))
setClass("Employee", representation(boss = "Person"), contains = "Person")a new object within the class is created with new()
Neo <- new("Person", name="Neo", age = 31)To access slots of an S4 object, you’ll need to use @ and not $. The getSlots() returns a descritpion of the slots of a class.
Neo@age
getSlots("Person")The following commands are used for S4 generic functions rather than the usual ? or help calls:
getMethod(method,"Person") # prints source code for method
selectMethod(method, "Person") # will climb the inheritance to find method
showMethods(classes="Person") # show all methods for classThe more data you have, the slower your analyses are going to run. Therefore, it can be wise to subsample your data for this workshop. To do so, you can use the following script on the terminal. It first downloads the seqtk program. Place the program in the file where your fastq files are stored. Then you’ll use a loop in order to subsample reads within each sample. For the script to work your samples have to be named as such: SampleX_R1.fastq and SampleX_R2.fastq, X being a number identifying your samples. You will have to modify either the script or the names of your samples if they follow a different annotation. The subsampled reads will be placed in files named subsampleX_R1.fastq and subsampleX_R2.fastq (X being a number identifying your samples).
git clone https://github.com/lh3/seqtk.git; # Downloads the seqtk program
cd seqtk; make
for i in {1..10} ; # For sample 1 to sample 10 (you can adjust this to the number of samples you have)
do
./seqtk sample -s100 Sample${i}_R1.fastq 500 > subsample${i}_R1.fastq ;
#Takes 500 random reads in Sample${i}_R1.fastq and places them in the subsample${i}_R1.fastq file. Likewise you can adjust 500 to the number of reads you want to keep per sample.
./seqtk sample -s100 Sample${i}_R2.fastq 500 > subsample${i}_R2.fastq ;
# Same thing as above but for the reverse / R2 reads. It will take the reverse reads corresponding to the forward reads that have been selected above.
done